Add the possibility to compute statistics #369
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds the possibility to compute statistics (count all runs of a property, along with the number of times it fails and the number of times it triggers an exception) for all tests. The main goal is to provide a sound argument whether a given change has a (positive or negative) impact on the ability of a test to detect issues. It is an incomplete draft, but advanced enough that we can decide whether it should be completed.
As we have no statistics to back the assumption that this PR has no impact, it tries to be as non-intrusive as possible. Here is how it proposes to do this:
statistics are enabled by setting the environment variable
$QCHECK_STATISTICS_FILEto the path of a file in which the stats will be recorded (I think that adding a command-line option would require duplicatingQCheck_base_runner.Raw.parse_cli, which I would rather avoid)a couple of QCheck functions are “routed” through
Utilso that their behaviours can be adapted when computing stats:fail_reportfis reexported so that it returnsfalserather than raising an exception (so that failures and exceptions can be distinguished),make_neg_testwrapsTest.make_negso that it usesTest.makewhen statistics are enabled (because tests will always report success, and count failures separately);make_testwrapsTest.makefor uniformityrun_tests_mainis duplicated in order to run each test separately when statistics are enabled so that we can record the stats for each test separately and to enable the long version of tests (so$QCHECK_LONG_FACTORcan be used to multiply the number of tests to run)Note that this routing could be useful also to add optional hooks before and after each test (what Weak test revisions #365 and Ephemeron test revision #367 have done by hand for weak and ephemeron tests).
most of the counting logic is done in
Util.repeat: when stats are enabled, it uses a different loop that will count iterations and negative outcomes rather than stop at the first failure; this means that even the tests not using repetitions are now wrapped into arepeat 1to get statistics workingUtil.Stats.enabledis exposed so that tests can choose a different repetition count depending on this.A possible alternate design that would not rely quite so much on global state in
Util(and would thus allow to userun_testswithout splitting test lists) would be to useUtil.make_testandUtil.make_neg_testto expect aStats.t -> 'a -> boolfunction (that would be the result ofrepeat, with the added benefit that the change of type detects places that are not updated...) and generate fresh stats for each test. It is not yet in that version of the PR because I’m not sure which one is the best choice.To do before merging, if this proposal is good enough:
Utilto avoid useless history noise,repeatandmake_testsignatures),Utilor the changes of behaviours,To do in a later PR: